In this part you'll implement a small comparative-analysis project, heavily based on the materials from the tutorials and homework.
project/ directory. You can import these files here, as we do for the homeworks.you can read more about the dataset here: https://github.com/pedropro/TACO
and can explore the data distribution and how to load it from here: https://github.com/pedropro/TACO/blob/master/demo.ipynb
The stable version of the dataset that contain 1500 images and 4787 annotations exist in datasets/TACO-master
You do not need to download the dataset.
(you need to install it..)
you can review good models for the coco-OD task as a referance: SOTA: https://paperswithcode.com/sota/object-detection-on-coco Real-Time: https://paperswithcode.com/sota/real-time-object-detection-on-coco Or you can use older models like YOLO-V3 or Faster-RCNN
Good luck!
in this project we fintuned and compared the performence of 3 well known object detection models - DETR, YOLO and Faster RCNN. each model has different architecture and represents a different learning strategy. our goal is to compare their performence on the TACO dataset and derive insights on each one of them seperately. below we represent this process
%load_ext autoreload
%autoreload 2
import sys
import json
sys.path.append('/home/ilay.kamai/mini_project/detr')
from PIL import Image as pil_image
first, we explored the data - images and bounding boxes. we did it in two ways: manually, and using exploration functions. the manuall inspection gave a lot of insights and particularry showed that there are 2 problems with the annotations:
we fixed those problems using the methods "filter_anns" and "fill_anns" in project/utils.py file. we also splitted the train annotation into validation and train. the file project/utils.py consists of data exploration methods. after that, we reformatted the dataset to fit the api's we worked with. the file project/modify_dataset.py creates data structure that fits the coco format. we used that for for DETR and faster-rcnn finetuning. The file project/yolo/create_taco_data.py creates files structure for YOLO model.
from project.utils import anns_hist
for s in ['train', 'val', 'test']:
anns_hist(f"annotations_{s}.json")
Number of images: 946
Number of images with no annotations: 0
annotations DataFrame:
Categories Number of annotations inv_density weights
0 metals_and_plastic 1520 1.986193 0.000997
2 non_recyclable 848 3.558304 0.001785
6 unknown 306 9.840391 0.004937
3 glass 185 16.241935 0.008149
4 paper 155 19.365385 0.009716
5 bio 6 431.571429 0.216537
1 other 1 1510.500000 0.757878
Number of images: 237
Number of images with no annotations: 0
annotations DataFrame:
Categories Number of annotations inv_density weights
0 metals_and_plastic 408 1.970660 0.001742
2 non_recyclable 241 3.330579 0.002944
6 unknown 71 11.194444 0.009895
4 paper 53 14.925926 0.013194
3 glass 31 25.187500 0.022265
5 bio 2 268.666667 0.237490
1 other 0 806.000000 0.712470
Number of images: 317
Number of images with no annotations: 0
annotations DataFrame:
Categories Number of annotations inv_density weights
0 metals_and_plastic 1179 2.356780 0.007526
6 unknown 741 3.747978 0.011969
2 non_recyclable 602 4.611940 0.014727
4 paper 118 23.369748 0.074627
3 glass 96 28.670103 0.091553
1 other 28 95.896552 0.306229
5 bio 17 154.500000 0.493369
after exploring the data we moved to train the first model
pil_image.open('imgs/detr_architecture.png')
to finetune DETR, we used this fork from the original repo of DETR: https://github.com/woctezuma/detr/tree/finetune/models we modified this repo with the following main improvements:
we also experimented with focal_loss vs cross entropy loss with weights. in order to find the best hyperparameters we used optuna package to search for an optimal combination (the code can be found in project/detr/opt.py). unfortunately, due to memory issues, we were able to run the optimization process only with batch_size=1 , which was not very effective, so we did most of the hyperparameters tuning manually. we tried to use the training-set inverse frequencies of the classes as weights but eventually found that a more "hard" weightening (were all classes execpt the 2 most frequent classes, has weight of 1) gave better results on the validation set.
# import project.detr.detr_finetune as detr_finetune
# detr_finetune.finetune()
DETR results -
first we look at the training process graphs with focal loss:
im1 = pil_image.open('imgs/detr/mAP_loss_focal.png')
im2 = pil_image.open('imgs/detr/losses_focal.png')
im1.show()
im2.show()
now we look at the training with cross entropy with weights
# from IPython.display import display
# from IPython.display import Image
im1 = pil_image.open('imgs/detr/mAP_loss_full.png')
im2 = pil_image.open('imgs/detr/losses_full.png')
im1.show()
im2.show()
note that to compare between them we need to look at the accuracy, mAP, and not the loss value (ass those are different functions). the mAP metric in the above graphs is mAP25-75 - the average among iou-thresholds [0.25,0.50,0.75]. To calculate the mAP we used torchmetric package (we also imlemented it from scratch in project/detr/detr_predict.py. we got the same results and found this package very easy to use so we used it). we see that the focal loss mAP is much lower than cross entropy with weights. another interesting obseravtion is by looking at the specific loss elements. when using focal loss, the limiting loss (one that is overfitting) is the giou (general intersection over union), while when using cross entropy it is the cross entropy itself (when using focal loss the loss_ce refers to the focal loss). this implies that focal loss indeed improve the labels classification but at the expense of the bounding boxes iou. the fact that cross entropy with weights was better than focal loss can be explained by the fact that the data in not highly imbalanced and the improvement for the rare classes we get from the focal term is at the expense of the accuracy of the frequent classes. this also might be a results of non optimal parameters when using focal loss.
next, we trained the model on the entire train set, for the same number of epochs and evaluated the accuracy on the test set. the results of the last training are shown below:
im1 = pil_image.open('imgs/detr/test_loss.png')
im2 = pil_image.open('imgs/detr/test_losses.png')
im1.show()
im2.show()
now let's examine the predictions. first in a more visual way. below are sample predictions together with the prediction probability on top of the real boxes and labels:
for p in ['0', '30', '60', '90']:
im = pil_image.open('imgs/detr/preds_{}.png'.format(p))
im.show()
we can see that the prediction are overall really good. bounding boxes are almost prefectly aligned and the labels are usually correct (but not always).
for a in ['0', '30', '60', '90']:
print(f'sample_{a}')
for b in ['0', '1']:
im = pil_image.open('imgs/detr/attn_{}_{}_attn.png'.format(a,b))
im.show()
sample_0
sample_30
sample_60
sample_90
by looking at the attention weights, it is clear that there is a connection between the bounding box and the pixels with high attention. the meaning is that the model learned to focus on the desired objects. nevertheless, the are examples where the attention weights are high for parts in the image that are not of our interest. this can be understood as a remenant of the pretraining phase, where the model trained on dataset with much more classes.
pil_image.open('imgs/detr/cls_dist.jpeg')
we see that overall, the predictions distributions fit the real ones. nevertheless, there is an over-prediction of the third class and under prediction of the sixth class, roughly to the same amount. the explanation for that is the differences in the real classes distributions between train and test sets. we saw at the beginning of the notebook that there is a difference between those classes ('unknown' and 'non_recycable') frequencies. since we trained on the train distribution and tested on the test distribution, we expect this change to be reflected in the results and this is exactly what we observe. (note that at the beginning of the notebook all the annotations are shown and here only the legal ones. still, we expect that at least quantitavely, to see the effect).
detr_acc_path = 'imgs/detr/test_acc.json'
with open(detr_acc_path, 'r') as json_file:
data = json.load(json_file)
print(json.dumps(data, indent=4))
{
"map": 0.28570878505706787,
"map_50": 0.3183254599571228,
"map_75": 0.16300788521766663,
"map_small": 0.058816004544496536,
"map_medium": 0.3416192829608917,
"map_large": 0.5460426807403564,
"mar_1": 0.2738052010536194,
"mar_10": 0.35033294558525085,
"mar_100": 0.35442739725112915,
"mar_small": 0.11222628504037857,
"mar_medium": 0.403219074010849,
"mar_large": 0.5912919640541077,
"map_per_class": [
0.48206546902656555,
0.0,
0.31224218010902405,
0.25925925374031067,
0.6606858372688293,
0.0,
-1.0
],
"mar_100_per_class": [
0.5789473652839661,
0.0,
0.47455471754074097,
0.3541666567325592,
0.6879432797431946,
0.03095238097012043,
-1.0
],
"classes": [
1,
2,
3,
4,
5,
7,
8
]
}
specifically we see mAP50 of 0.32 and mAP25-75 of 0.29.
next, we trained YOLOv8 model. YOLO (You Only Look Once) is CNN based architecture that aims to detect and classify objects in images with real-time efficiency. YOLO divides the image into a grid and predicts bounding boxes and class probabilities directly within each grid cell.
im = pil_image.open('imgs/yolo_architecture.jpg')
im.show()
to train YOLO we used the built-in ultralytics train function without modifying the source code at all. instead, we used the high-level api which allows to control various training paramters by sending keword arguments to the model.train method. nevertheless, one of the changes that we do needed to do in order to comply with ultralytics API, was to change the directories structure and the annotations format (from COCO to YOLO format). this was done using the code in project/yolo/create_tako_data.py.
# import project.yolo.yolo_finetune as yolo_finetune
# yolo_finetune.finetune()
below is a comparison between the different models sizes:
nano = pil_image.open('imgs/yolo/results_nano.png')
small = pil_image.open('imgs/yolo/results_small.png')
large = pil_image.open('imgs/yolo/results_large.png')
print('nano:')
nano.show()
print("small:")
small.show()
print("large:")
large.show()
nano:
small:
large:
it can be seen that the difference in performence between small and large is ~1% mAP while the difference in the number of parameters is ~20M. this is not an ideal trade-off.
precision and recall on the validation set can be seen below:
pil_image.open('imgs/yolo/PR_curve.png').show()
pil_image.open('imgs/yolo/P_curve.png').show()
pil_image.open('imgs/yolo/R_curve.png').show()
# display(pr, p, r)
the Recall-Confidence graph and Precision-confidence graph shows an interesting phenomena - interperting both graphs we see that higher confidence threshold increases the precision, but lowers the recall. the meaning is that as the confidence threshold increaases, the model "narrowing" and "refining" its predictions (small recall = "narrowing", high precision = "refining"). when the recall reaches 0 the process stops and this is where we see the straight line in the precision curve (constant precision). we also see that the decrease in the recall is much more drastic for the rare labels. this is a result of the class imbalance. since there are less samples for rare labels, the model fail to recognize them and focus on the the more frequent labels.
for b in ['0', '1', '2']:
for p in ['labels', 'pred']:
im = pil_image.open('imgs/yolo/val_batch{}_{}.jpg'.format(b,p))
print(p, b)
im.show()
labels 0
pred 0
labels 1
pred 1
labels 2
pred 2
we can see that overall the predictions are not great - there are missing bounding boxes and incorrect classes.
pil_image.open('imgs/yolo/confusion_matrix_normalized.png').show()
the confusion matrix above shows that the model was able to learn only 2 classes (background means no class).
pil_image.open('imgs/yolo/PR_curve_test.png').show()
pil_image.open('imgs/yolo/P_curve_test.png').show()
pil_image.open('imgs/yolo/R_curve_test.png').show()
on the test set we got mAP50 of 0.0934 and mAP50-95 of 0.0662
from the above plots and images we can clearly see the difference between classes - the most representable class has the best performence and the model overpredict it, also in cases were the real label is different. this is expected for class imbalance and we expect that modifying the loss to mitigate that (with weights or focal loss) should improve performence. we didn't do it because we found the api of ultralytics very complex (when trying to modify functions rather than training the model as is) and we didn't have the time to invest in this route.
lastly, we moved to train faster-rcnn model. faster-rcnn is a convolution based model with 2 stages - region proposal (RPN) and fast rcnn network. both part have different roles - RPN creates proposals for bounding boxes and the rcnn netwrok use those proposals (in the feature-map space) to predict bounding boxes and classes.
im = pil_image.open('imgs/rcnn_architecture.jpg')
im.show()
we used pytorch model (torchvision.models.detection) and pretrained weights.
to train the model we used a modified version of the Trainer class we used in the hw (in project/rcnn/train_rcnn.py). we used the same dataset and augmentations and followed the same learning tricks as in DETR (early stopping, weight decay, etc.).
to run faster-rcnn model, run the following cell (again, you need to modify the paths to fit your paths)
# import project.rcnn.fasterrcnn as fasterrcnn
# fasterrcnn.finetune()
below are the training graphs that compare the results of focal loss and smoothed cross entropy on the training and validation set
pil_image.open('imgs/rcnn/mAP_loss_all.png').show()
we see that focal loss gives better results than ce. we used focal loss and trained the model on the entire training set and evaluated it on the test set.
pil_image.open('imgs/rcnn/test_loss.png').show()
below are sample predictions on the test set:
for p in ['0', '30', '60', '90']:
im = pil_image.open('imgs/rcnn/preds_{}.png'.format(p))
im.show()
looking at the sample predictions we see that the bounding boxes are usually good. this is probably due to the change in anchor sizes, that enable the model to process small objects, and the nms (non-maximum supression) threshold that removes overlappes. however, the labels predictions are effectively majority class classifier. the model always predict the most frequent class. we expected that focal loss would mitigate this but didn't get good results in that regard. this is interesting by itself since for DETR we found that cross entropy outperformed focal loss and here we see the opposite. this might be realted to the abbility of transformers to distinguish between classes using attetntion (giving different attention weights to different classes) like we saw in the class. anyway, maybe a more in-depth hyper-parameters tuning would improve this.
the class distributions and mAP results on the test set are:
pil_image.open('imgs/rcnn/cls_dist.jpeg').show()
rcnn_acc_path = 'imgs/rcnn/test_acc.json'
with open(rcnn_acc_path, 'r') as json_file:
data = json.load(json_file)
print(json.dumps(data, indent=4))
{
"map": 0.08888889104127884,
"map_50": 0.08897058665752411,
"map_75": 0.0887254923582077,
"map_small": 0.0361669659614563,
"map_medium": 0.1666666716337204,
"map_large": 0.20000000298023224,
"mar_1": 0.06625016778707504,
"mar_10": 0.134905144572258,
"mar_100": 0.14788758754730225,
"mar_small": 0.1472356617450714,
"mar_medium": 0.16577060520648956,
"mar_large": 0.1944444477558136,
"map_per_class": [
0.5333333611488342,
0.0,
0.0,
0.0,
0.0,
0.0
],
"mar_100_per_class": [
0.8835088014602661,
0.0,
0.003816793905571103,
0.0,
0.0,
0.0
],
"classes": [
1,
2,
3,
4,
5,
7
]
}
specifically we get mAP50 - 0.089 and mAP25-75 - 0.089.
the following graph present a comparison between all the three models based on the test-set mAP50 and the number of parameters
import matplotlib.pyplot as plt
accs = [32, 9.34, 8.9]
names = ['DETR', 'YOLOv8', 'Faster-RCNN']
params = [41.3, 43.7, 43]
for a,p,n in zip(accs, params, names):
plt.scatter(p, a, s=p*5, label=n, alpha=0.5)
plt.title("Accuracy TACO object detection")
plt.xlabel('number of parameters (M)')
plt.ylabel("mAP50")
plt.legend()
plt.show()
when comapring the accuracy of the three models together with the number of paramters it is clear that DETR is much better than YOLO and Faster rcnn. all three models share similar number of parameters but the accuracy of DETR (which has even less parameters than the other two) is much better. also by looking at the sample predictions we can see that the number of bounding boxes, their locations and the labels confidence are better in DETR than those of YOLO and RCNN. we also observed that this was achieved even with minimal training of DETR, which points on the power of attention and the set matching loss. to get even better results with DETR we suggests the following:
both directions requires time and resources that we didn't have during this project.
Relate the number of parameters in a neural network to the over-fitting phenomenon (*). Relate this to the design of convolutional neural networks, and explain why CNNs are a plausible choice for an hypothesis class for visual classification tasks.
(*) In the context of classical under-fitting/over-fitting in machine learning models.
# from torch.nn.functional import log_softmax
# from torch import gather
# import torch
# Input: model, x, y.
# Output: the loss on the current batch.
# logits = model(x)
# log_probs = log_softmax(logits, dim=1)
# gathered_log_probs = gather(log_probs, 1, y.view(-1, 1))
# loss = -torch.mean(gathered_log_probs)
Assume that you want to train a model to perform two tasks: task 1 and task 2. For each such task $i$ you have an already implemented function loss_i = forward_and_compute_loss_i(model,inputs) such that given the model and the inputs it computes the loss w.r.t task $i$ (assume that the computational graph is properly constructed). We would like to train our model using SGD to succeed in both tasks as follows: in each training iteration (batch) -
Note that in the above formulation the gradient is a thought of as a concatination of all the gradient w.r.t all the models parameters, and $g_1 \cdot g_2$ stands for a dot product.
What parts should be modified to implement the above? Is it the optimizer, the training loop or both? Implement the above algorithm in a code cell/s below
Consider the following two-input two-output function: $$ f(x,y) = (x^2\sin(xy+\frac{\pi}{2}), x^2\ln(1+xy)) $$
In each one of the following scenarios decide whether to use RNN based model or a transformer based model. Justify your choice.
Suggest a method for combining VAEs and GANs. Focus on the different components of the model and how to train them jointly (the objectives). Which drawbacks of these models the combined model may overcome? which not?
Show that $q(x_{t-1}|x_t,x_0)$ is tractable and is given by $\mathcal{N}(x_{t-1};\tilde{\mu}(x_t,x_0),\tilde{\beta_t}I)$ where the terms for $\tilde{\mu}(x_t,x_0)$ and $\tilde{\beta_t}$ are given in the last tutorial. Do so by explicitly computing the PDF.
For both BatchNorm and Dropout analyze the following:
(*): In a multi-GPU training each GPU is associated with its own process that holds an independent copy of the model. In each training iteration a (large) batch is split among these processes (GPUs) which compute the gradients of the loss w.r.t the relevant split of the data. Afterwards, the gradients from each process are then shared and averaged so that the GD would take into account the correct gradient and to assure synchornization of the model copies. Note that the proccesses are blocked between training iterations.